Add human scoring workflow#39
Open
Swiftyos wants to merge 4 commits into
Open
Conversation
Adds an end-to-end "Score" surface for human review of completed runs: a new persisted human_dimension_scores table mirroring judge_dimension_scores, HTTP routes that drain an unscored backlog one chat at a time, and a React dashboard view with rubric/objective/tool-call sidebars and Pearson-correlation pills against the LLM judge scores. Replaces the legacy inline dashboard with the built React bundle as the only frontend, and adds a one-shot seed-test-scores script for retargeting old data onto the new product rubric. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The PostgresRepository.listPresets path uses `sql.unsafe(RUN_SUMMARY_COLUMNS)` inline inside a tagged template to interpolate the column list. JS evaluates that call eagerly before the tagged template runs, so the mock's `sql.unsafe` was being invoked with just the column list and throwing because the text did not match any "from <table>" branch. Make `sql.unsafe` return an inert empty result for fragment-style calls instead of throwing; the parent template still records the real query string so the existing query-count assertions hold. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-run of `docs:quality` and `docs:workspace` after the test fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
Adds the human scoring workflow so reviewers can manually score completed scenario runs against rubric dimensions and compare those scores with automated judge output. The branch includes dashboard UI, HTTP API routes, SQLite/Postgres persistence support, migrations, and a seed script for test scores.
Behavior changes
/api/human-scoring/rubrics,/api/human-scoring/next, and/api/human-scoring/scoresserver routes.Validation
./scripts/fast-feedback.shpassed — not run for PR creation.bun run cipassed, or not required for this change — not run for PR creation.Targeted validation run:
bun test tests/unit/persistence/human-scoring.test.ts tests/integration/server/human-scoring.test.tsResult: passed, 7 tests.
Screenshots / video
N/A for this PR creation pass.